graph LR
A["Traditional Benchmarks<br/>(MMLU, etc.)<br/>90%+ accuracy"] --> B["Benchmark<br/>Saturation"]
B --> C["GPQA Diamond<br/>198 PhD-level questions<br/>Experts: 65%"]
C --> D["Meaningful signal<br/>for frontier AI<br/>reasoning"]
style A fill:#e74c3c,stroke:#333,color:#fff
style B fill:#f39c12,stroke:#333,color:#fff
style C fill:#27ae60,stroke:#333,color:#fff
style D fill:#3498db,stroke:#333,color:#fff
GPQA Diamond
A graduate-level, Google-proof science benchmark where PhD experts reach only 65% — and frontier AI models now surpass them
Keywords: GPQA Diamond, AI benchmark, graduate-level science QA, Google-proof questions, PhD-level evaluation, frontier LLM benchmark, physics chemistry biology, expert-level reasoning, COLM 2024, NYU benchmark

Introduction
Most AI benchmarks, even challenging ones like MMLU, have been saturated by frontier models — with leading systems scoring over 90%. This makes them useless for distinguishing between state-of-the-art models or measuring genuine scientific reasoning.
GPQA Diamond is different. It is the hardest, most vetted subset of the Graduate-Level Google-Proof QA Benchmark — 198 multiple-choice questions in biology, physics, and chemistry so difficult that PhD-level domain experts only reach 65% accuracy. Non-expert validators, given over 30 minutes with full internet access, score only 34% — making these questions truly “Google-proof.”
“We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: even highly skilled non-expert validators only reach 34% accuracy, despite spending over 30 minutes with unrestricted access to the web.” — GPQA Paper
What Is GPQA Diamond?
GPQA (Graduate-Level Google-Proof QA) is a benchmark of 448 multiple-choice questions across three science domains. It comes in three subsets of increasing difficulty and quality:
| Subset | Questions | Description |
|---|---|---|
| GPQA Extended | 546 | All collected questions including lower-quality ones |
| GPQA Main | 448 | Filtered for quality and difficulty |
| GPQA Diamond | 198 | Hardest subset — both the expert author AND an independent expert validator agreed on the correct answer |
Why “Diamond”?
The Diamond subset applies the strictest quality filter: a question is included only if both the original expert who wrote it and a separate domain expert (who independently attempted it) agreed on the correct answer. This double-validation ensures every question is:
- Unambiguously correct — verified by two independent experts
- Genuinely difficult — not solvable through surface-level reasoning or web search
- High signal — provides maximum information about model capabilities
Key Characteristics
| Feature | Details |
|---|---|
| Total questions | 198 (Diamond subset) |
| Domains | Biology, Physics, Chemistry |
| Question type | Multiple-choice (4 options) |
| Expert accuracy | 65% (PhD-level domain experts) |
| Non-expert accuracy | 34% (with 30+ minutes and full web access) |
| Original GPT-4 baseline | 39% (November 2023) |
| License | CC-BY-4.0 |
What Makes It “Google-Proof”?
graph TD
Q["PhD-level science<br/>question posed"] --> E["Domain Expert<br/>(PhD holder)<br/>65% accuracy"]
Q --> N["Non-Expert Validator<br/>(30+ min, full web)<br/>34% accuracy"]
Q --> M["GPT-4<br/>(Nov 2023 baseline)<br/>39% accuracy"]
E --> V{"Both experts<br/>agree on answer?"}
V -->|Yes| D["Included in<br/>GPQA Diamond"]
V -->|No| X["Excluded from<br/>Diamond subset"]
style Q fill:#8e44ad,color:#fff,stroke:#333
style E fill:#27ae60,color:#fff,stroke:#333
style N fill:#e74c3c,color:#fff,stroke:#333
style M fill:#f39c12,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style X fill:#95a5a6,color:#fff,stroke:#333
The term “Google-proof” means that non-expert validators — intelligent individuals without domain-specific PhD training — cannot solve these questions even with unlimited internet access. The questions require deep conceptual understanding, multi-step reasoning, and expert-level domain knowledge that cannot be pieced together from search results alone.
Who Built It?
GPQA was developed at New York University (NYU) by:
- David Rein — Lead author
- Betty Li Hou, Asa Cooper Stickland, Jackson Petty — Core researchers
- Richard Yuanzhe Pang, Julien Dirani, Julian Michael — Contributing researchers
- Samuel R. Bowman — Senior advisor (NYU)
Publication
GPQA was published at the First Conference on Language Modeling (COLM 2024), one of the premier venues for language model research.
| Resource | Link |
|---|---|
| arXiv paper | arxiv.org/abs/2311.12022 |
| GitHub repository | github.com/idavidrein/gpqa |
| Hugging Face dataset | huggingface.co/datasets/Idavidrein/gpqa |
| Conference | COLM 2024 (First Conference on Language Modeling) |
What Skills Does It Test?
GPQA Diamond tests deep expert-level scientific reasoning — not surface-level knowledge retrieval.
graph TD
GPQA["GPQA Diamond<br/>198 questions"] --> P["Physics<br/>Quantum mechanics,<br/>thermodynamics,<br/>relativity"]
GPQA --> C["Chemistry<br/>Organic reactions,<br/>spectroscopy,<br/>molecular structure"]
GPQA --> B["Biology<br/>Molecular biology,<br/>genetics,<br/>biochemistry"]
style GPQA fill:#e74c3c,color:#fff,stroke:#333
style P fill:#3498db,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style B fill:#f39c12,color:#fff,stroke:#333
| Capability | What GPQA Diamond Tests |
|---|---|
| Graduate-level knowledge | Questions require PhD-level understanding in specific subfields |
| Multi-step reasoning | Most questions demand chaining multiple concepts together |
| Resistance to search | Answers cannot be found via web search — they require deep understanding |
| Cross-domain synthesis | Some questions span subdisciplines within a field |
| Calibration | Whether models can accurately assess their own confidence |
Example Difficulty
A typical GPQA Diamond question might ask about the outcome of a specific quantum mechanical calculation, the product of a multi-step organic synthesis, or the implications of a particular genetic regulatory mechanism — requiring graduate-level coursework and research experience to answer correctly.
Current Leaderboard
The table below compiles GPQA Diamond accuracy scores from official model announcements and technical reports. All scores are pass@1 (single attempt) unless otherwise noted.
Sources: OpenAI model announcements (o1 blog, o3-mini blog), Google DeepMind (Gemini 2.5 blog), Anthropic model cards, original GPQA paper. Consulted July 2025.
| Rank | Model | Accuracy (%) | Source |
|---|---|---|---|
| — | Human domain experts (PhDs) | 65.0 | GPQA paper |
| — | Non-expert validators (30+ min, web) | 34.0 | GPQA paper |
| 1 | o1 (OpenAI) | 77.3 | OpenAI o1 blog |
| 2 | o3-mini (high) (OpenAI) | 77.0 | OpenAI o3-mini blog |
| 3 | o1-preview (OpenAI) | 73.3 | OpenAI o1 blog |
| 4 | GPT-4o (OpenAI) | 50.6 | OpenAI o1 blog |
| 5 | GPT-4 (OpenAI, 2023 baseline) | 39.0 | GPQA paper |
Key takeaways:
- o1 was the first AI model to surpass human PhD experts on GPQA Diamond (77.3% vs. 65%), a milestone highlighted by OpenAI
- o3-mini (high) matches o1 performance at significantly lower cost
- The gap between non-experts with web access (34%) and experts (65%) confirms questions are genuinely “Google-proof”
- Even GPT-4o (50.6%) falls short of PhD expert performance, despite being far more capable than the original GPT-4 baseline
Note: More recent models — including o3, Gemini 2.5 Pro, Claude 3.7 Sonnet (extended thinking), and DeepSeek-R1 — have also been evaluated on GPQA Diamond. Google reports Gemini 2.5 Pro as “state-of-the-art” on GPQA. For the latest results, consult the resources listed below.
Where to Explore the Benchmark
Dataset and Code
| Resource | Description | Link |
|---|---|---|
| Hugging Face Dataset | Full GPQA dataset (Main, Extended, Diamond splits) | huggingface.co/datasets/Idavidrein/gpqa |
| GitHub Repository | Evaluation code, baselines, and documentation | github.com/idavidrein/gpqa |
| arXiv Paper | Full technical paper with methodology and analysis | arxiv.org/abs/2311.12022 |
Load the Dataset
from datasets import load_dataset
dataset = load_dataset("Idavidrein/gpqa", "gpqa_diamond")Understanding the Metrics
Pass@1 Accuracy
The primary metric. Each question is a 4-option multiple-choice problem. The model produces a single answer, and accuracy is the fraction of correct responses. Random baseline is 25%.
Consensus@64
Some evaluations (notably OpenAI’s) also report consensus@64: the model generates 64 responses per question, and the final answer is selected by majority vote. This measures the model’s best achievable performance with repeated sampling.
| Model | Pass@1 | Consensus@64 |
|---|---|---|
| GPT-4o | 50.6% | 56.1% |
| o1-preview | 73.3% | 78.3% |
| o1 | 77.3% | 78.0% |
Key insight: The small gap between pass@1 and consensus@64 for o1 (77.3% vs. 78.0%) suggests the model’s answers are highly consistent — it either knows or doesn’t know, with little variance.
Why GPQA Diamond Matters
graph LR
A["Expert-level<br/>difficulty"] --> C["GPQA Diamond<br/>as a yardstick"]
B["Google-proof<br/>questions"] --> C
C --> D["Measures genuine<br/>scientific reasoning"]
C --> E["Human-AI<br/>comparison point"]
style A fill:#e74c3c,color:#fff,stroke:#333
style B fill:#e74c3c,color:#fff,stroke:#333
style C fill:#27ae60,color:#fff,stroke:#333
style D fill:#3498db,color:#fff,stroke:#333
style E fill:#3498db,color:#fff,stroke:#333
- First benchmark where AI surpassed PhD experts — o1’s 77.3% vs. experts’ 65% was a landmark moment for AI capabilities
- Measures deep reasoning, not retrieval — “Google-proof” design ensures models must actually understand the science
- Standard evaluation for frontier models — reported by every major AI lab in model release announcements
- Clear human baseline — the 65% expert ceiling provides a meaningful reference point
- Focused on STEM — targets the science domains most relevant to AI safety and capability concerns
Video: GPQA Diamond Explained
Please subscribe to the Vectoring AI YouTube channel for more video tutorials 🚀
Conclusion
GPQA Diamond stands as one of the most important benchmarks in AI evaluation:
- 198 rigorously vetted questions in biology, physics, and chemistry — double-validated by independent PhD experts
- PhD-level domain experts score only 65% — and non-experts with full web access score just 34%
- The first benchmark where AI surpassed human experts — OpenAI’s o1 reached 77.3%, crossing the 65% expert threshold
- Built at NYU and published at COLM 2024, establishing it as a peer-reviewed standard
- Remains a key differentiator for frontier models — reported in every major model release
As reasoning-focused models continue to improve, GPQA Diamond provides a critical measure of whether AI systems possess genuine scientific understanding — not just the ability to pattern-match answers from training data.
References
- Rein, D., Hou, B.L., Stickland, A.C., Petty, J., Pang, R.Y., Dirani, J., Michael, J., & Bowman, S.R. “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.” First Conference on Language Modeling (COLM), 2024. arxiv.org/abs/2311.12022
- OpenAI. “Learning to reason with LLMs.” September 2024. openai.com/index/learning-to-reason-with-llms
- OpenAI. “OpenAI o3-mini.” January 2025. openai.com/index/openai-o3-mini
- Google DeepMind. “Gemini 2.5: Our most intelligent AI model.” March 2025. blog.google/technology/google-deepmind/gemini-model-thinking-updates-march-2025
- Rein, D. “GPQA Dataset.” Hugging Face. huggingface.co/datasets/Idavidrein/gpqa
- Rein, D. “GPQA GitHub Repository.” github.com/idavidrein/gpqa
Read More
- Explore another frontier benchmark — see Humanity’s Last Exam (HLE)
- Test physical commonsense across 116 languages — see Global PIQA
- Track model costs when running evaluations — see FinOps Best Practices for LLM Applications
- Deploy models for running your own evaluations — see Deploying and Serving LLM with vLLM
- Understand quantization trade-offs for evaluation — see Quantization Methods for LLMs